k-Means Clustering on Election Primary

Amar Seoparson

For my project, I used k-Means Clustering on a dataset containing statistics regarding the 2016 Primaries. I first processed the dataset in order to cut it down to only the counties that voted for Trump. I then normalized this data and ran a k-Means algorithm on it to see what groups of people voted for him.



In [85]:

    
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

Importing the .csv files and processing them into a single dataframe.



In [86]:

    
# processing .csv containing county statistics
counties = pd.read_csv('county_facts.csv')
drop_columns = ["state_abbreviation", "fips"]
counties.drop(drop_columns,inplace=True,axis=1)
# combine it with .csv containing primary statistics
primary = pd.read_csv('primary_results.csv')
primary = pd.concat([primary,counties], axis=1)
trump = primary[primary['candidate'] == 'Donald Trump'].sort_index()
# drop the features we don't need
drop_columns = ["state_abbreviation", "party", "candidate","area_name"]
trump.drop(drop_columns,inplace=True,axis=1)
# get rid of counties with no statistical data
trump = trump.fillna(0.0)
trump = trump[trump['POP010210'] > 0]
trump.head()









    Out[86]:






  
    
      
      state
      county
      fips
      votes
      fraction_votes
      PST045214
      PST040210
      PST120214
      POP010210
      AGE135214
      ...
      SBO415207
      SBO015207
      MAN450207
      WTN220207
      RTN130207
      RTN131207
      AFN120207
      BPS030214
      LND110210
      POP060210
    
  
  
    
      135
      Alabama
      Autauga
      1001.0
      5387
      0.445
      7755.0
      8116.0
      -4.4
      8116.0
      5.5
      ...
      0.0
      12.9
      184521.0
      10852.0
      79676.0
      9727.0
      4648.0
      1.0
      667.39
      12.2
    
    
      140
      Alabama
      Baldwin
      1003.0
      23618
      0.469
      12125.0
      12245.0
      -1.0
      12245.0
      4.8
      ...
      0.0
      21.9
      0.0
      8179.0
      42434.0
      3598.0
      5299.0
      1.0
      618.19
      19.8
    
    
      145
      Alabama
      Barbour
      1005.0
      1710
      0.501
      33368.0
      32923.0
      1.4
      32923.0
      5.3
      ...
      0.0
      23.8
      858460.0
      77002.0
      207424.0
      6511.0
      16843.0
      0.0
      615.20
      53.5
    
    
      150
      Alabama
      Bibb
      1007.0
      1959
      0.494
      72297.0
      77435.0
      -6.6
      77435.0
      6.2
      ...
      0.0
      25.6
      0.0
      0.0
      867380.0
      10922.0
      87074.0
      55.0
      870.75
      88.9
    
    
      155
      Alabama
      Blount
      1009.0
      7390
      0.487
      13970.0
      14134.0
      -1.2
      14134.0
      4.4
      ...
      0.0
      32.8
      0.0
      0.0
      57549.0
      4202.0
      0.0
      0.0
      561.52
      25.2
    
  

5 rows × 56 columns

This creates a dataframe containing all of the counties where Trump won. Now the data has to be normalized.



In [87]:

    
state = trump["state"]
county = trump["county"]
# any of the features in the trump dataframe can be used, these were chosen because they seemed interesting
# percent who voted for donald trump
fraction_votes = trump["fraction_votes"]
fraction_votes_norm = np.array((fraction_votes - fraction_votes.min()) / (fraction_votes.max() - fraction_votes.min())).reshape(-1,1)
# median household income of the country
median_income = trump["INC110213"]
median_income_norm = np.array((median_income - median_income.min()) / (median_income.max() - median_income.min())).reshape(-1,1)
# percent of people in the county who were born outside of america
foreign_born = trump["POP645213"]
foreign_born_norm = np.array((foreign_born - foreign_born.min()) / (foreign_born.max() - foreign_born.min())).reshape(-1,1)
# percent of people in the county who graduated high school
high_school = trump["EDU635213"]
high_school_norm = np.array((high_school - high_school.min()) / (high_school.max() - high_school.min())).reshape(-1,1)
# percent of people in the county with a bachelors degree
bachelors = trump["EDU685213"]
bachelors_norm = np.array((bachelors - bachelors.min()) / (bachelors.max() - bachelors.min())).reshape(-1,1)
# the features to be used in k-Means are added to 2-D arrays
trump_norm = np.hstack((high_school_norm, median_income_norm))

Graphs showing the relationships between some of these features and the election results are displayed below.



In [88]:

    
# graphs of the normalized data
f, axarr = plt.subplots(2, 2)
axarr[0,0].set_title('Income and Trump Votes')
axarr[0,0].scatter(median_income_norm, fraction_votes_norm, c='red')
axarr[0,1].set_title('Foreigners and Trump Votes')
axarr[0,1].scatter(foreign_born_norm, fraction_votes_norm, c='green')
axarr[1,0].set_title('College and Trump Votes')
axarr[1,0].scatter(bachelors_norm, fraction_votes_norm, c='blue')
axarr[1,1].set_title('High School and Trump Votes')
axarr[1,1].scatter(high_school_norm, fraction_votes_norm, c='yellow')
plt.setp([a.get_xticklabels() for a in axarr[0, :]], visible=False)
plt.setp([a.get_yticklabels() for a in axarr[:, 1]], visible=False)
plt.show()

With the data normalized the k-Means algorithm can be run on it. In order to find the optimal number of clusters the silhouette score was calculated for different numbers.



In [89]:

    
best_nc = 0
best_ss = 0
for n_clusters in range(2,10):
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(trump_norm)
    silhouette_avg = silhouette_score(trump_norm, cluster_labels)
    print("For", n_clusters,"clusters, the average silhouette score is", silhouette_avg)
    if silhouette_avg > best_ss:
        best_nc = n_clusters
        best_ss = silhouette_avg
print("The best number of clusters is",best_nc)









    



For 2 clusters, the average silhouette score is 0.417204527717
For 3 clusters, the average silhouette score is 0.436471878714
For 4 clusters, the average silhouette score is 0.376470905095
For 5 clusters, the average silhouette score is 0.370916049204
For 6 clusters, the average silhouette score is 0.385980951897
For 7 clusters, the average silhouette score is 0.374056524232
For 8 clusters, the average silhouette score is 0.337198437131
For 9 clusters, the average silhouette score is 0.343756549117
The best number of clusters is 3

With the optimal number of clusters, the most accurate model can be created.



In [90]:

    
kmeans = KMeans(n_clusters=best_nc, random_state=10)
kmeans.fit(trump_norm)









    Out[90]:





KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=10, tol=0.0001, verbose=0)

The results of the k-Means algorithm are plotted below. The code for plotting is taken from the sklearn documentation.



In [91]:

    
h = .02 
x_min, x_max = trump_norm[:, 0].min() - 1, trump_norm[:, 0].max() + 0.5
y_min, y_max = trump_norm[:, 1].min() - 1, trump_norm[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')
plt.plot(trump_norm[:, 0], trump_norm[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=100, linewidths=3,
            color='w', zorder=10)
plt.title('K-means Clustering on Primary Results')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.show()

Findings

The two clusters on the top and bottom are spread out and include many outliers. The middle cluster is the most dense and likely represents the average Trump-supporting county. From this, it appears that counties with Trump supporters have average to below-average median household incomes and average to slightly-below-average high school graduation rates.

	state	county	fips	votes	fraction_votes	PST045214	PST040210	PST120214	POP010210	AGE135214	...	SBO015207	MAN450207	WTN220207	RTN130207	RTN131207	AFN120207	BPS030214	LND110210	POP060210
135	Alabama	Autauga	1001.0	5387	0.445	7755.0	8116.0	-4.4	8116.0	5.5	...	12.9	184521.0	10852.0	79676.0	9727.0	4648.0	1.0	667.39	12.2
140	Alabama	Baldwin	1003.0	23618	0.469	12125.0	12245.0	-1.0	12245.0	4.8	...	21.9	0.0	8179.0	42434.0	3598.0	5299.0	1.0	618.19	19.8
145	Alabama	Barbour	1005.0	1710	0.501	33368.0	32923.0	1.4	32923.0	5.3	...	23.8	858460.0	77002.0	207424.0	6511.0	16843.0	0.0	615.20	53.5
150	Alabama	Bibb	1007.0	1959	0.494	72297.0	77435.0	-6.6	77435.0	6.2	...	25.6	0.0	0.0	867380.0	10922.0	87074.0	55.0	870.75	88.9
155	Alabama	Blount	1009.0	7390	0.487	13970.0	14134.0	-1.2	14134.0	4.4	...	32.8	0.0	0.0	57549.0	4202.0	0.0	0.0	561.52	25.2